Continuing my SQL projects, I will be performing data cleaning in a more advanced way. The data set that I will be cleaning have the information about Nashville Housing, that contains information about the Property Address, Sales Date, Sales Price, Owner Name and much more.
In order to use SQL databases, you must establish connection with Microsoft SQL Server. One way to do that, is using “dbConnect”:
Let’s take a first look into our data:
SELECT *
FROM "Portfolio"."dbo"."nashville"ALTER TABLE Portfolio.dbo.nashville
ADD SaleDateConverted Date;
UPDATE Portfolio.dbo.nashville
SET SaleDateConverted = CONVERT(Date,SaleDate)SELECT SaleDateConverted, SaleDate
FROM Portfolio.dbo.nashvilleFirst let’s take a look if there is missing values on the PropertyAddress column.
SELECT PropertyAddress
FROM "Portfolio"."dbo"."nashville"
WHERE PropertyAddress IS NULL| PropertyAddress |
|---|
SELECT ParcelID, PropertyAddress
FROM "Portfolio"."dbo"."nashville"
ORDER BY ParcelIDSelect a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, ISNULL(a.PropertyAddress,b.PropertyAddress) AS NewPropertyAddress
From "Portfolio"."dbo"."nashville" a
JOIN "Portfolio"."dbo"."nashville" b
on a.ParcelID = b.ParcelID
AND a.[UniqueID ] <> b.[UniqueID ]
Where a.PropertyAddress is nullUpdate a
SET PropertyAddress = ISNULL(a.PropertyAddress,b.PropertyAddress)
From "Portfolio"."dbo"."nashville" a
JOIN "Portfolio"."dbo"."nashville" b
on a.ParcelID = b.ParcelID
AND a.[UniqueID ] <> b.[UniqueID ]
Where a.PropertyAddress is nullSelect a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, ISNULL(a.PropertyAddress,b.PropertyAddress) AS NewPropertyAddress
From "Portfolio"."dbo"."nashville" a
JOIN "Portfolio"."dbo"."nashville" b
on a.ParcelID = b.ParcelID
AND a.[UniqueID ] <> b.[UniqueID ]
Where a.PropertyAddress is null| ParcelID | PropertyAddress | ParcelID | PropertyAddress | NewPropertyAddress |
|---|
In our column, PropertyAddress, we can see that there a comma, separating the address from the city. Let’s create a new column to get the name of the cities.
ALTER TABLE Portfolio.dbo.nashville
ADD PropertySplitAddress Nvarchar(255);
Update Portfolio.dbo.nashville
SET PropertySplitAddress = SUBSTRING(PropertyAddress, 1, CHARINDEX(',', PropertyAddress) -1 )ALTER TABLE Portfolio.dbo.nashville
ADD PropertySplitCity Nvarchar(255);
Update Portfolio.dbo.nashville
SET PropertySplitCity = SUBSTRING(PropertyAddress, CHARINDEX(',', PropertyAddress) + 1 , LEN(PropertyAddress))Select PropertySplitAddress, PropertySplitCity
From Portfolio.dbo.nashvilleWe also can do something very similar to get information on Address, City and State for the OwnerAddress column. But first, let’s look on this column:
SELECT OwnerAddress
FROM "Portfolio"."dbo"."nashville"| OwnerAddress |
|---|
| 1808 FOX CHASE DR, GOODLETTSVILLE, TN |
| 1832 FOX CHASE DR, GOODLETTSVILLE, TN |
| 1864 FOX CHASE DR, GOODLETTSVILLE, TN |
| 1853 FOX CHASE DR, GOODLETTSVILLE, TN |
| 1829 FOX CHASE DR, GOODLETTSVILLE, TN |
| 1821 FOX CHASE DR, GOODLETTSVILLE, TN |
| 2005 SADIE LN, GOODLETTSVILLE, TN |
| 1917 GRACELAND DR, GOODLETTSVILLE, TN |
| 1428 SPRINGFIELD HWY, GOODLETTSVILLE, TN |
| 1420 SPRINGFIELD HWY, GOODLETTSVILLE, TN |
PARSENAME to get this information:-- Address
ALTER TABLE Portfolio.dbo.nashville
Add OwnerSplitAddress Nvarchar(255);
Update Portfolio.dbo.nashville
SET OwnerSplitAddress = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 3)-- City
ALTER TABLE Portfolio.dbo.nashville
Add OwnerSplitCity Nvarchar(255);
Update Portfolio.dbo.nashville
SET OwnerSplitCity = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 2)-- State
ALTER TABLE Portfolio.dbo.nashville
Add OwnerSplitState Nvarchar(255);
Update Portfolio.dbo.nashville
SET OwnerSplitState = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 1)SELECT OwnerSplitAddress, OwnerSplitCity, OwnerSplitState
FROM Portfolio.dbo.NashvilleLet’s change the values on SoldAsVacant column: Y -> Yes and N -> No.
SELECT DISTINCT(SoldAsVacant), Count(SoldAsVacant)
FROM "Portfolio"."dbo"."nashville"
Group by SoldAsVacant
order by 2| SoldAsVacant | |
|---|---|
| Yes | 4669 |
| No | 51704 |
Update Portfolio.dbo.nashville
SET SoldAsVacant = CASE When SoldAsVacant = 'Y' THEN 'Yes'
When SoldAsVacant = 'N' THEN 'No'
ELSE SoldAsVacant
ENDSELECT DISTINCT(SoldAsVacant), Count(SoldAsVacant)
FROM "Portfolio"."dbo"."nashville"
Group by SoldAsVacant
order by 2| SoldAsVacant | |
|---|---|
| Yes | 4669 |
| No | 51704 |
Let’s find out all the duplicate values in our data set:
WITH RowNumCTE AS(
Select *,
ROW_NUMBER() OVER (
PARTITION BY ParcelID,
PropertyAddress,
SalePrice,
SaleDate,
LegalReference
ORDER BY
UniqueID
) row_num
FROM "Portfolio"."dbo"."nashville"
)
Select *
From RowNumCTE
Where row_num > 1
Order by PropertyAddressAnd delete them:
WITH RowNumCTE AS(
Select *,
ROW_NUMBER() OVER (
PARTITION BY ParcelID,
PropertyAddress,
SalePrice,
SaleDate,
LegalReference
ORDER BY
UniqueID
) row_num
FROM "Portfolio"."dbo"."nashville"
)
DELETE
From RowNumCTE
Where row_num > 1WITH RowNumCTE AS(
Select *,
ROW_NUMBER() OVER (
PARTITION BY ParcelID,
PropertyAddress,
SalePrice,
SaleDate,
LegalReference
ORDER BY
UniqueID
) row_num
FROM "Portfolio"."dbo"."nashville"
)
Select *
From RowNumCTE
Where row_num > 1
Order by PropertyAddressALTER TABLE Portfolio.dbo.nashville
DROP COLUMN OwnerAddress, TaxDistrict, PropertyAddress, SaleDate;